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A name entity (NE) is a proper name that designates a person, location, or 
organization. For humans, named entity recognition (NER) is a 
straightforward process insofar as many named entities are self-names, and 
most of them have initial capital letters and can be easily recognized, but it is 
very difficult for machines. This study discusses research trends in the 


application of NER to Indonesian datasets, particularly as it concerns certain 


tasks, datasets, methods/techniques, and entity labels. By conducting a 
Keywords: systematic literature review (SLR) and bibliometric analysis with 
VOSviewer, this article hopes to provide opportunities for adopting old 
methods, combining models from previous research, and even proposing 
new methods. In addition, the motivation for doing SLR at NER is to look 
for new strategies in the supervision of financial technology (Fintech). If 
machines can find illegal Fintech entities on social media and online news, it 
can help the government to block these illegal Fintech entities. To this end, 
this study provides an overview of research trends in applying the NER 
method to Bahasa Indonesia (Indonesian) datasets, including the extraction 
of news articles, the monitoring of floods, and traffic. 
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1. INTRODUCTION 

Named entity (NE) was introduced at the sixth message understanding conference (MUC-6). With 
the introduction of NE, the MUC conference has helped to advance the field of information extraction [1]. NE 
refers to a proper name that designates a person, location, or organization. For example, there are three NE in 
the following sentence: “James is a doctoral student in the Faculty of Computer Science at the University of 
Indonesia.” James an NE insofar as it is the name of a person (P); Indonesia refers to a location (L); and the 
Faculty of Computer Science refers to the organization (O). Named entity recognition (NER) is a procedure 
that finds, extracts, and automatically classifies named entities from open domains and unstructured texts such 
as newspaper articles. It then categorizes these NE into predefined types [2]. There are four approaches to 
NER: i) a rule-based approach, which does not require annotated data because it relies on artificial rules; ii) an 
unsupervised learning approach; iii) a feature-based supervised learning approach that relies on supervised 
learning algorithms with careful feature engineering; and iv) a deep-learning-based approach, which 
automatically finds the required representation for detecting or classifying raw input in an end-to-end manner 
[3], [4]. NER is a straightforward process for humans because many named entities are self-names, and most 
of them have initial capital letters and can be easily recognized, but for machines, it is very difficult [5]. 
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Information extraction often uses data available on social media, online news, and e-commerce [3]. 
Much information can thereby be obtained, including product reviews, analysis, and information extraction. 
For example, NER research is used for Indonesian news articles [6]. The use of NER is also carried out for 
the extraction of comments related to flood monitoring and traffic monitoring [7], [8]. On the other hand, the 
use of this method is also useful for quote identification [9]. The role of language in text-analysis often 
determines which model is used [10], because not all libraries are available for specific tasks [2]. 

NER has been applied to a wide variety of tasks [3], but a brief survey of the application of NER to 
texts in the Indonesian language reveals a total of only 241 documents (accessed December 2021). 
Meanwhile, the need to perform NER with Indonesian datasets is continuing to grow. Currently, there are 
libraries and tools available to facilitate machine learning (ML) as it pertains to the use of NE to extract 
information, but are there enough datasets? To what extent is NER used to extract information on social 
media and online news in an Indonesian-language context? Not all natural language processing (NLP) 
functions are available in Indonesian because, unlike in English, the functions that rely on the ML model 
mentioned above are not directly supported [11], [12]. 

In addition, another motivation for doing SLR is triggered by the emergence of illegal financial 
technology (Fintech) problems [13], [14]. Several previous studies on Fintech have been carried out and the 
main side that can be solved is by monitoring entities on social media [15]. But the challenge is that not all 
corpus (text set) is available in all languages. 

This study looks at research trends in the application of NER to Indonesian datasets, including 
specific tasks, datasets, method/techniques, and entity labels. Therefore, this article will help facilitate the 
design of experiments to extract Fintech information on social media and online news. With the hope that it 
is not only a Fintech platform but can be a proposal for supervision of agencies or organizations based on 
social media data and online news. 


2. METHOD 
2.1. Systematic literature review 

First, this article presents a SLR of the field of NER research. A SLR aims to collect all research on 
a particular topic, evaluates it critically, and reaches conclusions that synthesize that research. Then follows a 
discussion of how NER has been applied to Indonesian texts. SLR has been used in various research domains 
such as P2P lending [13], Fintech [14], Teaching and learning via webinars [16], supply chain management 
model [17], and software engineering [18]. 

A SLR was carried out in three stages: the planning stage, the implementation stage and the 
reporting stage (see Figure 1). In the first stage, the planning stage is carried out to identify the need for a 
systematic review of the use of the agile project management (APM) method. At this stage, a review protocol 
was also developed by setting research questions (RQ) and formulating a boolean search to determine search 
keywords. This study used the population, intervention, comparison, outcomes, and context (PICOC) strategy 
to determine the RQ, as shown in Table 1. 


Planning Implementation Reporting 


«Identification of 
Research 
*Selection of 


*Identification of 
SLR Needs 

*Development ofa 

review protocol 


Writings 


primary studies 
*Data Extraction 
and Data synthesis 


Figure 1. Steps for a SLR 


Table 1. Criteria of research question 


Population (P) Name-entity recognition, NER, name entity recognition 
Intervention (I) Online news, social media 
Comparison (C) n/a 
Outcomes (O) Trend and application of NER in Indonesian context 
Context (C) Bahasa, Indonesia 


There follows the RQ that guided the following analysis: 
RQ. “What are the trends in the application of NER to extract information from Indonesian online news and 
social media?” 
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In this study, the search string is (“named-entity recognition” OR NER OR “named entity 
recognition”) AND (“online news” OR “social media”) AND (Indonesia* OR Bahasa). According to the 
research question, the criteria for inclusion and exclusion in Table 2 were used to define the results. In the 
second stage, this research defines a search strategy, namely selecting a publication database, selection 
results for research, data extraction and the synthesis process. These processes are sequential processes where 
each process aims to find the right study to be used in this research. The search and selection process are an 
elimination process based on the criteria specified in each process. 

The authors collected papers from relevant electronic databases such as SCOPUS, ACM, 
IEEEXplore, and Science Direct, then used Mendeley software to organize the data. Some irrelevant papers 
were omitted in the first stage of collection based on the title and abstract. The second stage of selection 
articles is a full-text selection. Figure 2 illustrates the procedure of text-selection. The total number of papers 
obtained from the four databases was initially 241. Upon completion of the selection procedure, however, 
only 20 papers remained. The low number of papers is both a challenge to and an opportunity for NER 
research in an Indonesian context, as few studies have used the “Bahasa” dataset. The third stage is reporting 
the results and analyzing the results of this review. We mapped research results from previous studies and 
examined how the experimental process in NER was, what libraries could be used for Indonesian language 
datasets, how to approach NER, and proposed future research. 


Table 2. Criteria of selection studies process 


Inclusion criteria Exclusion criteria 
The paper studied about NER The paper is not using English 
Studies published in the last 5 years, between 2016-2021 Not full-text paper 
The paper being studied is in the form of a journal or Same papers from different database 


proceedings/conference 
Papers discussing NER but not the Indonesian text dataset 


Exclude papers based on 
title & abstract. ACM 
(26). IEEE (0), 
ScienceDirect (49) and 
SCOPUS (124) 


Relevant for further 
review (44): ACM (2), 


Potentially related papers 
(241): ACM (26), IEEE 
(7), ScienceDirect (58) 

and SCOPUS (150) 


IEEE (7), ScienceDirect 
(9) and SCOPUS (26) 


Exclude papers based on 
full-text: ACM (2), IEEE 
(2), ScienceDirect (5) 
and SCOPUS (15) 


Final Papers (20) : ACM 
(0), IEEE (5), 


ScienceDirect (4) and 
SCOPUS (11) 


Figure 2. The selection procedure for final papers 


2.2. Bibliometrics analysis 

This study also presents a bibliometric analysis of the document results at the initial stage of 
selection. Bibliometrics is one way to perform statistical analysis of books, articles, or other publications. 
This analysis is carried out using data on the number and authors of scientific publications as well as articles 
and citations in them which aims to measure the outcomes of individuals or research teams, institutions, and 
countries, identify national and international networks and map the development of new fields of science and 
technology. The VOSviewer tool views keyword clusters and authors in the NER field and thereby helps to 
expand the scope of NER research. 


3. RESULTS AND DISCUSSION 
3.1. Bibliometrics analysis results 

VOSViewer software helps to visualize research trends by putting the keywords of articles into 
clusters and constructing diagrams from them. From Figure 3(a) that NER research began to develop in early 
2018 (see blue cluster) with content analysis tasks and experiments to identify documents and sentences. 
Some of these tasks included the comparison of precision-recall with simple ML model approaches such as 
conditional random fields (CRF) and support vector machines (SVM). It was not until the beginning of 2019 
(see green cluster), however, that research using Indonesian datasets began. It was also around that time that 
several other classification tasks also began to develop. Research data does not only come from scholarly 
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articles but is also available in the form of data and images on the web and on platforms such as Twitter. On 
the other hand, if we look at 2019-2021 (see yellow cluster), NER research has started working on fake news, 
conducting aspect-based sentiment analysis, and measuring the model’s performance. This is where the NER 
approach with deep learning (DL) begins to emerge. DL is one of the implementation methods of ML which 
aims to imitate the workings of the human brain using an artificial neural network or artificial reasoning 
network. The algorithm results are naturally expected to improve the performance of ML. 

Additionally, we conducted a VOSViewer analysis with the co-authorship feature to see which 
authors were actively researching NER topics. Of 684 authors, 49 met the threshold; however, the results 
show that researchers are not connected by any network. This shows that each NER experiment has its own 
research goals, dataset, methods/techniques, as well as part of speech (PoS) tagging process. It can be seen 
below that the one of the most active researchers in the field is Purwanti, see Figure 3(b). 
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Figure 3. Co-occurrence network of author keywords (a) co-authorship network and (b) from 2016 to 2021 


3.2. SLR results 
After investigating the VOSViewer results, we examined 20 articles collected from the ACM, IEEE, 
ScienceDirect, and SCOPUS portals. The articles were then extracted and mapped according on author, task, 
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dataset, and method/technique (see Table 3). It is clear from the table above that several NER studies with 
Indonesian datasets have been carried out for the following tasks: complaint classification [19], quote 
identification [9], [20], flood monitoring extraction [7], traffic monitoring [8], [21], tourist attractions [22], 
zakat [23], lipstick product reviews [24], and various model combination tests for twitter [25]-[28], online 


news [6], [29]-[31], and Wikipedia [32], [33]. 


Table 3. SLR results 


Author Tasks Dataset Methods/techniques 
Purwanti et al. Using InaNLP for complaint tweet 7,440 Twitter data InaNLP: Indonesia natural language 
[19] classification processing toolkit 


Wibawa et al. 
[29] 


Syaifudin et al. 
[9] 


Taufik et al. 
[25] 
Munarko et al. 
[26] 
Gunawan et al. 
[32] 


Herwanto et al. 
[21] 


Alifi et al. [8] 


Azarine et al. 
[27] 


Leonandya et 
al. [33] 


Wintaka et al. 
[28] 


Emcha et al. 
[20] 


Rosyiq et al. 
[22] 
Santoso et al. 
[6] 
Azzahra et al. 
[34] 


Yohanes et al. 


[35] 
Putra et al. [7] 


Sukmana et al. 
[23] 


Indarta et al. 
[24] 
Santoso et al. 


[30] 


Build Indonesian NER for newspaper 
articles with 15 classes 


Identify quotes from Indonesian 


online news texts 


Modelling NER on 
microblog messages 
Grouping formal and informal Twitter 


Indonesian 


Using deep learning to 
Indonesian-language entities 


identify 


Propose an information extraction 
method to map traffic conditions from 
tweets 

Designing model architecture for 
traffic information extraction 

Build NER on Indonesian-language 
tweets with hidden Markov model 
(HMM) and add POS tagging feature 
extraction 

Apply and evaluate the latest transfer 
learning techniques 


This study builds a model using a 
combination of deep learning and 
machine learning approaches for 
Twitter data. 
Extracting quotation on Indonesian 
online news 


Information Extraction of Indonesian 
Tourist Attractions 

Propose a hybrid approach for named 
entity recognition 

Conducting NER research for 
unstructured text format datasets in 
Indonesian using a deep learning 
approach 

Building a framework to build a 
corpus for extracting Indonesian 
public figure quotes 

Utilization of social media data to 
serve as flood monitoring data 
Building a knowledge graph for zakat 
involves data acquisition, extracting 
entities and their relationships, 
mapping to ontologies, and applying 
knowledge graphs and visualizations. 
Extraction of aspects and opinions on 
lipstick product reviews 

Extracting the ontology building 
concept automatically with NER 


457 online news data from: 
Detiknews.com, 
Kompas.com, 
Mediaindonesia.com 

2506 sentences from: 
kompas.com, tempo.co, and 
tribunnews.com 

600 Twitter data 


8,000 Twitter data 


4139 
Wikipedia 


sentences from 


3,013 Twitter data 


44,102 Twitter data 


500 Twitter data 


Online news from 
kompas.com and tempo.co 
and Indonesian Wikipedia 
250 formal tweets dan 350 
Informal tweets 


503 standard sentences and 
395 indirect quotation 
sentences from kompas.com, 
detik.com, tempo.co, 
tribunnews.com, and 
antaranews.com 

800 Twitter data 


51.241 entities 
Indonesian Online News 
500 Twitter data 


from 


Indonesian online newspaper 
(Kompas Daily) 


72,212 Twitter data 


The entire documents are 24 
documents with 15,979 words 
from online sources and four 
offline data documents. 


591 
8,574 
29.587 Indonesian online 
news articles collected from 
CNN Indonesia 


sentence reviews and 


Supervised machine learning in the 
NER (Naive Bayes, SVM, and 
simple logistic) 


Rule-based method 


Rule-based method 
CRF 


Hybrid bidirectional long short-term 
memory (BLSTM) and convolutional 
neural network (CNN) 

Rule-based method 


Bidirectional LSTM and CNN 


HMM 


Deep bidirectional language models 
and transfer learning 


BLSTM dan CRF 


SVM algorithm 


DBpedia ontology 
Hybrid CRFand K-Means 


HMM 


GloboQuotes, PARC 3.0, PolNeAR 
and Quootstrap 


Naive Bayes, random forest, SVM, 
logistic regression, and CRF 
Framework Indonesian-open domain 
information extractor for processing 
entity-relationship identification, 
mapping to ontology, and deploying 
knowledge graphs 

CRF and HMM 


End-to-end model deployment using 
BLSTM 
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3.3. Discussion 

The internet and especially social media are a strategic tool for disseminating information to the 
public. Techniques have recently emerged that allow one to extract information on a targeted topic from the 
internet and then to examine the relationship between the words associated with that topic. Moreover, these 
techniques allow one to map out the relationship between the chief exponents of that topic and perhaps even 
locate them by charting their movements. One technique, namely text mining, provides a set of 
methodologies and tools for finding, visualizing, and evaluating information from extensive collections of 
text data [36]. Four processes need to be executed in text mining (see Figure 4). There are two ways of 
collecting data from social media and online news: i) web crawling using an API or BOT automatically; ii) 
web scraping by inserting HTML or XML elements using the HTTP protocol. After the data is collected and 
cleaned, the next stage is pre-processing, which can be done with a tokenizer, by removing stopwords, or 
by stemming. 


Feedback}loop 


Figure 4. The basic process of text mining 


At that point a machine-learning approach to modeling performs a looping procedure for a final 
evaluation and validation. Finally, presentations of the data help to visualize the results of modeling after the 
tasks of categorization, recommendation, spam detection, and summarization have been completed. As 
explained in the introduction, the toolkit for analyzing languages, especially the natural language toolkit 
(NLTK), is intended for English. Each country needs to rely on other tools and cannot fully use NLTK. 
NLTK is a library and program for NLP written in the Python programming language. NLTK supports 
tokenization classification, stemming, tagging, parsing, and semantic reasoning functions. Some Indonesian 
language libraries have InaNLP, kateglo, BimaNLP, Indonesian Stemmer, Sastrawi, PySastrawi, and 
SentiStrengthID. Table 4 describes the most frequently used Indonesian language libraries. In addition, some 
tools and libraries for NER include SpaCy, GATE, OpenNLP, CoreNLP, NLTK, and CogcompNLP. 


Table 4. Indonesian libraries 


Libraries Description 

InaNLP [19] An interface for InaNLP and Deeplearning4j’s Word2Vec for Indonesian (Bahasa Indonesia) 
in the form of REST API. 

Kateglo The Indonesian thesaurus and glossary dictionary with 72253 dictionary entries, 191200 
glossary entries, 2012 proverb entries, and 3423 abbreviations and acronyms. 

BimaNLP Repository for Python codes supporting NLP tutorials in Indonesian 

Indonesian Stemmer [37] Stemming Effect Study on Information Search in Indonesian based on Porter Stemmer 

Sastrawi High-quality stemmer library for Indonesian Language (Bahasa) 

PySastrawi Ported from Sastrawi project in PHP to Python 

SentiStrengthID [38] Sentiment Strength Detection in Bahasa Indonesia 


NER is one of the first steps toward information extraction that seeks to find entities mentioned in a 
text and classify them into predefined categories such as the person’s name, organization, location, time, 
value, and percentage [3]. NER is used in many NLP fields and can help address many needs [25], [39], [40]. 
NER is a critical pre-processing tool for various downstream applications such as information recovery, 
query answering, and machine translation. Recognition of named entities in search queries will help 
understand user intent better, thus providing better search results [41]. 

It is important to classify the various approaches that NER employs. Even though they both carry 
out classification functions, various other approaches to NER continue to develop. Figure 5 illustrates the 
NER approach. The NER approach to the non-ML algorithm consists of four steps. First, the rule-based 
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method identifies the rules in the system that are made by themselves based on linguistic knowledge [9]. 
Second, the lexicon-based method works by first making a dictionary of opinion words (lexicon). Third, 
statistical based using probabilistic. For example, the CRF and HMM algorithms [24]. A CRF is a framework 
for building discriminative probabilistic models for segmenting and labelling sequential data. At the same 
time, HMM is the primary technique for POS tagging in NLP. HMM models observations using a Markovian 
process with a state that is not directly observed (hidden). The main idea of HMM is to solve the problem of 
sequence tagging. Fourth is ontology-based NER such as a machine-learning approach. This method can 
identify known terms and concepts in the unstructured or semi-structured text, but at the same time it also 
relies on updating. The ontology approach provides additional advantages in terms of making further 
reasoning and knowledge acquisition for the extracted concepts [23], [30]. 

In the field of NLP, researchers are interested in identifying the word class for each word in each 
sentence. For example, the sentence Ryan menendang bola (‘Ryan kicks the ball’). After the POS tagging 
process, the classification is “Ryan/noun menendang/verb bola/noun.” This is useful for choosing nouns in 
sentences. Word classes are referred to as syntactic categories. POS tagging is a form of sequential job 
classification. 

There also exist several schemes to annotate NER data. Widely used tagging schemes include 
inside-outside (IO), inside-outside-beginning (IOB), and beginning-inside-last-outside-unit (BILOU). If two 
tags appear consecutively, IO cannot distinguish between their boundaries. However, IOB and BILOU can 
incorporate boundary information but differ concerning their respective abilities to model more acceptable 
context information [41]. 


Figure 5. NER approaches 


In conducting the NER experiment, the lack of datasets in Indonesian provides an opportunity for 
further research to build new datasets. The important thing in building this dataset is how to conduct crawling 
and scrapping. If we conduct scrapping manually, it may be necessary to spend time copying and pasting 
data. So, the suggestion for scrapping is to use coding, applications, and or browser extensions. HTML 
parsing techniques can also be performed via JavaScript and target linear and branching HTML pages. This 
method is more efficient in identifying HTML scripts from websites which are then used to extract text, 
links, and data. There is no one hundred percent effective scrapping technique because the data obtained are 
not always neat, and this depends on the structure of the page. So, understanding the structure of website 
pages is essential. 

Second, after getting the dataset, we need to understand the data cleansing approach, including 
tokenizer, stemming, and stopwords. Several features to remove punctuation marks, numbers, and emoticons 
are used so that text data are of a high quality before being used during data analysis. Text preprocessing 
prepares unstructured text into good data so that they are ready to be processed. 
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Third, the prepared dataset is generally divided into training data, development sets, and testing sets 
in building ML models. Training is the process of building a data model, and testing is testing the 
performance of the learning model. Development sets are generally not used when the data set is small. For 
example, 80% training data and 20% testing data or 70% training data and 30% testing data. the right 
approach must be selected to carry out NER carefully. Several studies demand a high level of accuracy and a 
high percentage of F1 scores. 


3.4. Recommendations for illegal Fintech supervision strategies with the NER approach based on 
social media data and online news 

Based on the SLR, new ideas emerge to utilize this method in the era of technological and social 
media transformation. The digital economy can change society and business's economic activities, from what 
was originally manual to fully automated. This impacts the provision of financial services by startups and 
Fintech companies. Currently, Fintech practices in Indonesia are very developed, starting from payments, 
funding, and Robo-advisors. However, in its implementation, Fintech lending (online lending) received 
special attention because it caused several problems, namely the emergence of illegal fintech. Unreasonable 
billing processes, issues of personal data protection, and even moral hazards are the focus of the supervision. 
In Indonesia, a government website channel is available for illegal Fintech complaints, but people tend to use 
social media to submit their complaints [15]. With the NER concept described in the previous section and the 
basics of libraries, POS tagging, and named entities, this research becomes the basis for developing ML 
models in the early identification of platform names on social media. Figure 6 is our proposed Fintech 
supervision model with social media data and online news that can be used for further research. 


Comments on 
Social Media 


Social Media 
Crawling 


Data Preprocessing 


NER Model Testing 


Illegal Fintech Entity 
Extraction 


List of Illegal Fintech 


Figure 6. Proposed illegal Fintech supervision model with the NER approach 


4. CONCLUSION 

In conclusion, this study has provided an overview of research trends in applying the NER method 
to Indonesian datasets, including extracting news articles, flood monitoring, traffic monitoring, and quotation 
identification. Other areas of research to consider are data collection, building data sets, cleaning data, and 
selecting ML algorithm models for NER tasks. The theoretical implication of this research is to obtain the 
concept of NER and its application. This includes finding researchers and comparing the NER methods used. 
At the same time, the practical implication is that this NER approach can be used to extract social media 
comments for platform entity detection. As has been proposed, what is interesting is developing an Illegal 
Fintech supervision model from social media data. 

This survey has limitations because the number of articles reviewed is low due to the lack of 
research using Indonesian datasets. This is an opportunity for further research in developing models and 
libraries that use Indonesian datasets. In the field of computer linguistics, the grammatical structure of each 
country will be a consideration and a challenge that can be explored for future research. 
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